Given a training set $\mathcal{D} = \{(\mathbf{x}_i, y_i): \forall i \in N = \{1, \ldots, n\}\}$ where
each observation $\mathbf{x}_i \equiv (x_{i1},\ldots,x_{ip})^\intercal$ has $p$ attributes, and
target value vector $\mathbf{y} \equiv (y_1,\ldots,y_n)^\intercal$
We shall fit the data using a linear model
\begin{align} f(\mathbf{x}, \mathbf{w}) = w_0 + w_1 \phi_1(\mathbf{x}) + w_2 \phi_2(\mathbf{x}) + \cdots + w_{m-1} \phi_{m-1}(\mathbf{x}) \nonumber \end{align}where $\phi_i$ is a transformation of $\mathbf{x}$ called basis functions. We assume that the target value $y$ is given by a deterministic function $f(\mathbf{x}, \mathbf{w})$ with additive Gaussian noise $\epsilon \sim \mathcal{N}(0,\sigma^2)$ so that
\begin{align} y = f(\mathbf{x}, \mathbf{w}) + \epsilon \nonumber \end{align}In the matrix form, we have \begin{align*} f(\mathbf{x}, \mathbf{w}) &= \mathbf{\Phi} \mathbf{w} \\ \end{align*} where \begin{align*} \mathbf{w} &= (w_0, \ldots, w_{m-1}) \end{align*} and \begin{align*} \mathbf{\Phi} &= \left\{ \begin{array}{l} \phi(\mathbf{x}_1)\\ \phi(\mathbf{x}_2) \\ \vdots \\ \phi(\mathbf{x}_n) \end{array} \right\} = \left\{ \begin{array}{lllll} \phi_0(\mathbf{x}_1) = 1 & \phi_1(\mathbf{x}_1) & \phi_2(\mathbf{x}_1) & \ldots & \phi_{m-1}(\mathbf{x}_1) \\ \phi_0(\mathbf{x}_2) = 1& \phi_1(\mathbf{x}_2) & \phi_2(\mathbf{x}_2) & \ldots & \phi_{m-1}(\mathbf{x}_2) \\ \vdots & \vdots & \vdots & \ddots& \vdots\\ \phi_0(\mathbf{x}_n) = 1& \phi_1(\mathbf{x}_n) & \phi_2(\mathbf{x}_n) & \ldots & \phi_{m-1}(\mathbf{x}_n) \end{array} \right\}_{n \times m} \\ \end{align*}
When $\phi_j(\mathbf{x}) = x_j$, we have linear regression with input matrix \begin{align} \mathbf{\Phi} = \mathbf{X}_{n \times p} = (\mathbf{x}_1, \ldots, \mathbf{x}_n)^\intercal. \nonumber \end{align}
When $p=1$ and $\phi_j(x) = x^j$, we have polynomial fitting.
Note that the matrix $\mathbf{X}$ should be of size $n \times (p+1)$ because of regression constant. For the purpose of simplifying notations, we assume $x_{i1} = 1$ $\forall i \in N$ and hence ignore the notational inconvenience of dimension $p+1$.
In a frequentist setting,
- $\mathbf{w}$ is considered to be a fixed parameter, whose value is determined by some form of "estimator", and
- error on this estimate are obtained by considering the distribution of possible observed data sets $\mathcal{D}$.
In a Bayesian setting,
- We assume a prior probability distribution $p(\mathbf{w})$ before observing the data.
- The effect of the observed data $\mathcal{D}$ is expressed through $p(\mathcal{D}|\mathbf{w})$, i.e., likelihood function.
- Bayes' theorem \begin{align*} p(\mathbf{w}|\mathcal{D}) &= \frac{p(\mathcal{D}|\mathbf{w}) p(\mathbf{w})}{p(\mathcal{D})},\ \text{i.e., posterior} \propto \text{likelihood $\times$ prior} \end{align*}
evaluates the uncertainty in $\mathbf{w}$ after $\mathcal{D}$ is observed, where
- $p(\mathbf{w}|\mathcal{D})$ is the posterior probability after $\mathcal{D}$ is observed,
- $p(\mathcal{D})$ is the probability of evidence.
Least Squares¶
We minimize residual sum of squares (RSS) \begin{align*} \text{RSS}(\mathbf{w}) = \sum_{i=1}^{n} (y_i - f(\mathbf{x}_i, \mathbf{w}))^2 = (\mathbf{y} - \mathbf{\Phi} \mathbf{w})^\intercal(\mathbf{y} - \mathbf{\Phi}\mathbf{w}) \end{align*}
The first order condition gives \begin{align*} \mathbf{\Phi}^\intercal (\mathbf{y} - \mathbf{\Phi}\mathbf{w}) = 0 \end{align*}
and solving for $\mathbf{w}$ leads to the normal equation \begin{align*} \mathbf{\hat{w}} = (\mathbf{\Phi}^\intercal\mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \mathbf{y} = \mathbf{\Phi}^\dagger \mathbf{y} \end{align*}
where $\mathbf{\Phi}^\dagger$ is the Moore-Penrose pseudo-inverse of the matrix $\mathbf{\Phi}$. We have
¶
Theorem¶
$E(\mathbf{\hat{w}}) = \mathbf{w}$ and $\text{Var}(\mathbf{\hat{w}}) = (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \sigma^2$.
Proof¶
Note that $y_i$ are uncorrelated with constant variance $\sigma^2$, and $\mathbf{x}_i$ are fixed.
$\mathbf{\hat{w}}$ is an unbiased estimate of $\mathbf{w}$:
\begin{align*} \text{E}(\mathbf{\hat{w}}) & = \text{E}((\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \mathbf{y}) = \text{E}((\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \mathbf{\Phi}\mathbf{w}) = (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \mathbf{\Phi} \text{E}(\mathbf{w}) = \mathbf{w} \end{align*}Because
\begin{align*} \mathbf{\hat{w}} - \text{E}(\mathbf{\hat{w}}) & = (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \mathbf{y} - (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \mathbf{\Phi} \mathbf{w} = (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal (\mathbf{y} - \mathbf{\Phi} \mathbf{w}) \\ &\equiv (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \epsilon \\ \text{E}(\epsilon \epsilon^\intercal) &= \text{Var}(\epsilon) = \sigma^2 \mathbf{I}_{n}, \end{align*}the variance is
\begin{align*} \text{Var}(\mathbf{\hat{w}}) &= \text{E}((\mathbf{\hat{w}} - \text{E}(\mathbf{\hat{w}}))(\mathbf{\hat{w}} - \text{E}(\mathbf{\hat{w}}))^\intercal) = (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \text{E}(\epsilon \epsilon^\intercal) \mathbf{\Phi} (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \\ &= (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \mathbf{\Phi} (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \sigma^2 = (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \sigma^2 \end{align*}¶
¶
Theorem¶
The estimated variance \begin{align*} \hat{\sigma}^2 = \frac{1}{n-p} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2, \end{align*} where $\hat{y}_i = \mathbf{X} \mathbf{\hat{w}} + \epsilon$, is an unbiased estimate of the variance $\sigma^2$.
Proof¶
We need to show that $\textrm{E}(\hat{\sigma}^2) = \textrm{E}(\sigma^2)$. The vector of residual \begin{align*} \hat{\epsilon} = \mathbf{y} - \mathbf{\hat{y}} = \mathbf{y} - \mathbf{X} \mathbf{\hat{w}} = (\mathbf{I} - \mathbf{X} (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal) \mathbf{y} \equiv Q \mathbf{y} = Q (\mathbf{X} \mathbf{w} + \epsilon) = Q \epsilon \end{align*}
Note that $\textrm{tr}(Q) = \textrm{tr}(I) - \textrm{tr}(\mathbf{X} (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal) = n - \textrm{tr}((\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{X}) =n - \textrm{tr}(I_{p \times p}) = n - p$. Now, we are ready to show the estimator is unbiased.
We have $(n-p)\textrm{E}(\hat{\sigma}^2) = \textrm{E}(\hat{\epsilon}^\intercal \hat{\epsilon}) = \textrm{E}(\epsilon^\intercal Q \epsilon)$. Since $\epsilon^\intercal Q \epsilon$ is $1 \times 1$ matrix, so it is equal its trace. \begin{align*} \textrm{E}(\epsilon^\intercal Q \epsilon) &= \textrm{E}(\textrm{tr}(\epsilon^\intercal Q \epsilon)) = \textrm{E}(\textrm{tr}(Q \epsilon \epsilon^\intercal)) = \textrm{tr}(Q\textrm{E}(\epsilon \epsilon^\intercal))\\ & = \textrm{tr}(Q\textrm{E}(\sigma^2 \mathbf{I})) = \sigma^2 \textrm{tr}(Q) = \sigma^2 (n-p) \end{align*}
We can show further that \begin{align*} \frac{(n-p)\hat{\sigma}^2}{\sigma^2} &\sim \chi^2_{n-p} \end{align*} which also implies $\textrm{E}(\hat{\sigma}^2) = \textrm{E}(\sigma^2)$.
It can be checked that $Q^\intercal=Q=Q^2$. Since $Q^2 - Q=0$, the minimal polynomial of $Q$ divides the polynomial $z^2 - z$. So, the eigenvalues of $Q$ are among 0 and 1. Since $\textrm{tr}(Q) = n - p$ is also the sum of the eigenvalues multiplied by their multiplicity, we necessarily have that 1 is an eigenvalue with multiplicity $n - p$ and zero is an eigenvalue with multiplicity $p$.
Since $Q$ is real and symmetric, there exists an orthogonal matrix $V$ with \begin{align*} V^\intercal QV = \Delta = \textrm{diag}(\underbrace{1, \ldots, 1}_{n-p \textrm{ times}}, \underbrace{0, \ldots, 0}_{p \textrm{ times}}) \end{align*}
Since $\epsilon \sim N(0, \sigma^2 \mathbf{I})$, we have $\hat{\epsilon} \sim N(0, \sigma^2 Q)$. Let $K \equiv V^\intercal \hat{\epsilon}$. Then $K \sim N(0, \sigma^2 \Delta)$ with $K_{n-p+1}=\ldots=K_n=0$ and
\begin{align*} \frac{\|K\|^2}{\sigma^2} &= \frac{\|K^{\star}\|^2}{\sigma^2} \sim \chi^2_{n-p} \end{align*}with $K^{\star} = (K_1, \ldots, K_{n-p})^\intercal$. Since $V$ is orthogonal,
\begin{align*} \|\hat{\epsilon}\|^2 &= \|K\|^2=\|K^{\star}\|^2 \end{align*}Thus, \begin{align*} \frac{(n-p)\hat{\sigma}^2}{\sigma^2} &\sim \chi^2_{n-p} \end{align*}
¶
We use least squares to fit 10 input data points with different basis functions.
x_train, t_train = create_data(sinusoidal, 10, 0.25)
x_test = np.linspace(0, 1, 100)
t_test = sinusoidal(x_test)
basis_funcs = {'Polynomial': [8], 'Gaussian': [np.linspace(0, 1, 8), 0.1], 'Sigmoidal': [np.linspace(0, 1, 8), 10]}
fig = plt.figure(figsize=(12, 8))
for i, (key, value) in enumerate(basis_funcs.items()):
phi_train = globals()[key](*value).dm(x_train)
phi_test = globals()[key](*value).dm(x_test)
t, t_std = LeastSquares().fit(phi_train, t_train).predict_dist(phi_test)
plt.subplot(2, 2, i+1)
plt.scatter(x_train, t_train, facecolor="none", edgecolor="b", s=50, label="training data")
plt.plot(x_test, t_test, label="$\sin(2\pi x)$")
plt.plot(x_test, t, label="prediction")
plt.fill_between(x_test, t - t_std, t + t_std, color="orange", alpha=0.5, label="std.")
plt.title("{}".format(key))
plt.legend()
plt.show()
The case with multiple outputs (so far, we only consider $y$ is a single-dimensional variable for each observation) is a natural generalization.
Geometrical Interpretation¶
The least-squares solution is obtained by finding the orthogonal projection of the data vector $\mathbf{y}$ onto the subspace spanned by the basis $\phi_j(\mathbf{x})$.

The regression can be performed by successive orthogonalization. The algorithm is known as Gram-Schmidt orthogonalization procedure, which is equivalent to the QR decomposition of $\mathbf{X}$. If inputs are orthogonal, the regression process is independent for each feature.
Stochastic Gradient Descent¶
When the training data set is very large or received in a stream, a direct solution using the normal equations of the least squares may not be possible.
An alternative approach is the \textit{stochastic gradient descent} algorithm.
The total error function
- In general, the stochastic gradient descent algorithm is applying
where ${\tau}$ is the iteration number and $\eta$ is a learning rate parameter.
- When the error function is the sum-of-squares function (the order of evaluation does not change the result), then the algorithm is
Data is generated by a trigonomietric function
\begin{align} t = w_0 + w_1 \sin(x) + w_2 \cos(x) \nonumber \end{align}def trigonometric(x, w=[1.85, 0.57, 4.37]):
return np.dot(w, [np.ones_like(x), np.sin(x), np.cos(x)])
x_trig_train, t_trig_train = create_data(trigonometric, 20, 1, [0, 4.0*np.pi])
x_trig_test = np.linspace(0, 4.0*np.pi, 100)
t_trig_test = trigonometric(x_trig_test)
# design matrix
phi = np.array([np.ones_like(x_trig_train), np.sin(x_trig_train), np.cos(x_trig_train)]).transpose()
# we set initial w = 0
w = np.zeros(phi.shape[1])
eta = 0.25
# since we have 20 samples, we use gradient from each sample in order
for i in range(20):
w = w + eta * (t_trig_train[i] - np.dot(w.T, phi[i])) * phi[i]
t_predicted = trigonometric(x_trig_test, w)
plt.scatter(x_trig_train, t_trig_train, facecolor="none", edgecolor="b", s=50, label="training data")
plt.plot(x_trig_test, t_trig_test, c="g", label="trigonometric")
plt.plot(x_trig_test,t_predicted, c='r', label="fitting")
plt.legend(bbox_to_anchor=(1.05, 0.27), loc=2, borderaxespad=0.5)
plt.show()
Bias-Variance Decomposition¶
We seek a function $f(\mathbf{x})$ for predicting $y$ given the input $\mathbf{x}$. Let $h(\mathbf{x})$ be the conditional expectation, also known as the regression function where
\begin{align} h(\mathbf{x}) = \mathbb{E} (y \lvert \mathbf{x} ) = \int y p(y \lvert \mathbf{x}) dt \nonumber \end{align}The decision theory minimizes a loss function for penalizing errors in prediction
\begin{align*} \mathbb{E} (f) & = \int \int \left( f(\mathbf{x}) - y \right)^2 p(\mathbf{x}, y) d\mathbf{x} dy \end{align*}By conditioning on $\mathbf{x}$, we have
\begin{align*} \mathbb{E} (f) & = \mathbb{E} _{\mathbf{x}} \mathbb{E} _{y|\mathbf{x}} \left[ \left( f(\mathbf{x}) - y \right)^2 | \mathbf{x} \right] \end{align*}It suffices to minimize the expected loss pointwise
\begin{align*} h(\mathbf{x}_0) = \text{argmin}_{\alpha} \mathbb{E} _{y|\mathbf{x}} \left[ \left(y - \alpha \right)^2 | \mathbf{x} = \mathbf{x}_0 \right] = \mathbb{E} (y \lvert \mathbf{x} = \mathbf{x}_0 ) \end{align*}From a frequentist perspective, if consider a single input value $\mathbf{x}$, the expected loss is
\begin{align*} \mathbb{E} (L_{\mathbf{x}}) & = \int \int \left( f(\mathbf{x}) - y \right)^2 p(\mathbf{x}, y) d\mathbf{x} dy \end{align*}The above squared loss function arising from decision theory. However, the sum-of-squares error function that arose in the maximum likelihood estimation of model parameters.
¶
Bias-Variance Decomposition¶
For a particular data set $\mathcal D$, $f(\mathbf{x})$ takes the form $f(\mathbf{x}; \mathcal D)$.
\begin{align*} \mathbb{E} (L_{\mathbf{x}}) = \underbrace{\mathbb{E}_{\mathcal D}\left[ \left( f(\mathbf{x}; \mathcal D) - \mathbb{E}_{\mathcal D}\left[ f(\mathbf{x}; \mathcal D)\right] \right)^2 \right]}_{\text{variance}} + \underbrace{\left( \mathbb{E}_{\mathcal D}\left[f(\mathbf{x}; \mathcal D)\right] - h(\mathbf{x}) \right)^2}_{\text{bias$^2$}} + \underbrace{\int \text{Var}\left[y|\mathbf{x}\right] p(\mathbf{x}) d\mathbf{x}}_{\text{noise}} \end{align*}or
\begin{equation*} \text{expected loss = variance (of prediction) + bias$^2$ (of estimate) + noise (of data)} \label{bias_var_decomp} \end{equation*}Proof¶
\begin{align*} \mathbb{E} (L_{\mathbf{x}}) & = \int \int \left( f(\mathbf{x}) - y\right)^2 p(\mathbf{x}, y) d\mathbf{x} dy \\ & = \int \int \left( f(\mathbf{x}) - \mathbb{E} \left[ y | \mathbf{x}\right] + \mathbb{E} \left[ y | \mathbf{x}\right] - y \right)^2 p(\mathbf{x}, t) d\mathbf{x} dy \\ & = \int \int \left( ( f(\mathbf{x}) - \mathbb{E} \left[ y | \mathbf{x}\right] )^2 + \underbrace{2 ( f(\mathbf{x}) - \mathbb{E} \left[ y | \mathbf{x}\right] )(\mathbb{E} \left[ y | \mathbf{x}\right] -y)}_{\text{vanishes}} + (\mathbb{E} \left[ y | \mathbf{x}\right] - y )^2\right) p(\mathbf{x}, y) d\mathbf{x} dy \\ & = \int \left( f(\mathbf{x}) - \mathbb{E} \left[ y | \mathbf{x}\right] \right)^2 p(\mathbf{x}) d\mathbf{x} + \int \left( \mathbb{E} \left[ y | \mathbf{x}\right] -y \right)^2 p(\mathbf{x},y) d\mathbf{x} dy \\ & = \int \left( f(\mathbf{x}) - h(\mathbf{x}) \right)^2 p(\mathbf{x}) d\mathbf{x} + \underbrace{\int \text{Var}\left[y|\mathbf{x}\right] p(\mathbf{x}) d\mathbf{x}}_{\text{noise}} \end{align*}For any given data set $\mathcal{D}$, learning algorithm obtains a prediction function $y(\mathbf{x}; \mathcal{D})$.
The term is vanished because
\begin{align*} \int ( y(\mathbf{x}) - \mathbb{E} \left[ y | \mathbf{x}\right] ) \mathbb{E} \left[ y | \mathbf{x}\right] p(\mathbf{x},y) d\mathbf{x} dy &= \int ( y(\mathbf{x}) - \mathbb{E} \left[ y | \mathbf{x}\right] ) \mathbb{E} \left[ y | \mathbf{x}\right] p(\mathbf{x}) p(y|\mathbf{x})d\mathbf{x} dy \\ & = \int ( y(\mathbf{x}) - \mathbb{E} \left[ y | \mathbf{x}\right] ) \mathbb{E} \left[ y | \mathbf{x}\right] p(\mathbf{x}) (\int p(y|\mathbf{x}) dy)d\mathbf{x} \\ & =\int ( y(\mathbf{x}) - \mathbb{E} \left[ y | \mathbf{x}\right] ) \mathbb{E} \left[ y | \mathbf{x}\right] p(\mathbf{x}) d\mathbf{x} \\ \int ( y(\mathbf{x}) - \mathbb{E} \left[ y | \mathbf{x}\right] ) y p(\mathbf{x},y) d\mathbf{x} dy &= \int ( y(\mathbf{x}) - \mathbb{E} \left[ y | \mathbf{x}\right] ) y p(\mathbf{x}) p(y|\mathbf{x}) d\mathbf{x} dy \\ & = \int ( y(\mathbf{x}) - \mathbb{E} \left[ y | \mathbf{x}\right] ) p(\mathbf{x}) (\int y p(y|\mathbf{x}) dy ) d\mathbf{x} \\ & = \int ( y(\mathbf{x}) - \mathbb{E} \left[ y | \mathbf{x}\right] ) \mathbb{E} \left[ y | \mathbf{x}\right] p(\mathbf{x}) d\mathbf{x} \end{align*}Now, we consider \begin{align*} \int \left( f(\mathbf{x}) - h(\mathbf{x}) \right)^2 p(\mathbf{x}) d\mathbf{x} \end{align*}
We have \begin{align*} \left( f(\mathbf{x}; \mathcal D) - h(\mathbf{x}) \right)^2 =& \left( f(\mathbf{x}; \mathcal D) - \mathbb{E}_{\mathcal D}\left[ f(\mathbf{x}; \mathcal D)\right] + \mathbb{E}_{\mathcal D}\left[f(\mathbf{x}; \mathcal D)\right] - h(\mathbf{x}) \right)^2 \\ &= \left( f(\mathbf{x}; \mathcal D) - \mathbb{E}_{\mathcal D}\left[ f(\mathbf{x}; \mathcal D)\right] \right)^2+ \left(\mathbb{E}_{\mathcal D} \left[f(\mathbf{x}; \mathcal D)\right] - h(\mathbf{x}) \right)^2 \\ &\hspace{1cm} + \underbrace{2 \left( f(\mathbf{x}; \mathcal D) - \mathbb{E}_{\mathcal D}\left[f(\mathbf{x}; \mathcal D)\right] \right) \left( \mathbb{E}_{\mathcal D}\left[f(\mathbf{x}; \mathcal D)\right] - h(\mathbf{x}) \right)}_{\text{vanishes after taking $\mathbb{E}_{\mathcal D}$}} \end{align*}
We get mean squared error (MSE)
\begin{align*} \mathbb{E}_{\mathcal D}\left[\left( f(\mathbf{x}; \mathcal D) - h(\mathbf{x}) \right)^2 \right] = \underbrace{\mathbb{E}_{\mathcal D}\left[ \left( f(\mathbf{x}; \mathcal D) - \mathbb{E}_{\mathcal D}\left[ f(\mathbf{x}; \mathcal D)\right] \right)^2 \right]}_{\text{variance}} + \underbrace{\left( \mathbb{E}_{\mathcal D}\left[f(\mathbf{x}; \mathcal D)\right] - h(\mathbf{x}) \right)^2}_{\text{bias$^2$}} \end{align*}¶
Bias-Variance Decomposition provide insights on model complexity from a frequentist perspective.
- Noise cannot be reduced because it exists in the true data generation process.
- Bias-variance trade-off shows that very flexible models has low bias and high variance, but relatively rigid models has high bias and low variance.
- For example, least squares estimates $\mathbf{w}$ has the smallest variance among all linear unbiased estimates (Gauss-Markov Theorem). However, biased estimates (e.g., ridge regression) may have smaller variance.
The bias-variance trade-off is demonstrated by $k$-nearest-neighbor regression and linear regression. Consider $y = f(\mathbf{x}) + \epsilon$ where $\mathbb{E}(\epsilon) = 0$ and $\text{Var}(\epsilon) = \sigma^2$.
¶
$k$-nearest-neighbor regression¶
For $k$-nearest-neighbor regression fit, we have
\begin{align*} \hat{f}(\mathbf{x}) = \text{Avg} (y_i | \mathbf{x}_i \in N_{k} (\mathbf{x})), \end{align*}where $N_{k} (\mathbf{x})$ is the neighborhood containing the $k$ points in the training dataset closest to $\mathbf{x}$, the expected loss has the form
\begin{align*} \mathbb{E}(L_{\mathbf{x}_0}) &= \mathbb{E} \left[ (y - \hat{f}(\mathbf{x}_0))^2| \mathbf{x} = \mathbf{x}_0 \right]\\ & = \sigma^2 + \left[ f(\mathbf{x}_0) - \frac{1}{k} \sum_{\ell = 1}^{k} f(\mathbf{x}_{(\ell)})\right]^2 + \frac{\sigma^2}{k} \end{align*}linear regression¶
For a linear regression model, we have
\begin{align*} \hat{f}_p(\mathbf{x}) = \mathbf{x}^\intercal \beta, \end{align*}where $\beta$ has $p$ components, we have
\begin{align*} \mathbb{E}(L_{\mathbf{x}_0}) &= \mathbb{E} \left[ (y - \hat{f}(\mathbf{x}_0))^2| \mathbf{x} = \mathbf{x}_0 \right]\\ & = \sigma^2 + \left[ f(\mathbf{x}_0) - \mathbb{E} \hat{f}_p(\mathbf{x}_0) \right]^2 + \|\mathbf{X}(\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{x}_0 \|^2\sigma^2 \end{align*}Note the fit
\begin{align*} \hat{f}_p(\mathbf{x}_0) = \mathbf{x}^\intercal \beta = \mathbf{x}^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal y, \end{align*}and hence
\begin{align*} \text{Var}(\hat{f}_p(\mathbf{x}_0)) = \|\mathbf{X}(\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{x}_0 \|^2\sigma^2 \end{align*}¶
The bias-variance trade-off plays an important role in frequentist view. However, the insights on model complexity is of limited practical value, because
- the bias-variance decomposition is based on averages, $\mathbb{E}_{\mathcal D}$, with respect to several data sets, whereas in practice we have only the single observed data set.
- if we had a large number of independent training sets, we would be better off combining them, which of course would reduce the level of over-fitting for a given model complexity.
There are two reasons why we are not satisfied with the least squares estimates:
- Interpretation: instead of having a large number of predictors, we like to determine a smaller subset that exhibit the strongest effects.
- Prediction accuracy: the least squares estimates often have low bias but large variance. Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero.
Next, we will regularize regression models with some constraints. The constraints may add bias to the model, but the resulting model will have lower variance.
Subset Selection¶
We only consider a subset of the independent variables, and eliminate the rest from the model. Least squares regression is used to estimate the coefficients of the inputs that are retained.
Best subset regression finds for each subset size $k$ that gives smallest residual sum of squares $\text{RSS}(\mathbf{w})$.
Forward- and Backward-Stepwise Selection are greedy algorithms to find a subset of variables by producing a nested sequence of models.
Forward stepwise selection starts with the intercept, and then sequentially adds into the model the predictor that most improves the fit in terms of $\text{RSS}(\mathbf{w})$.
Backward-stepwise selection starts with the full model, and sequentially deletes the predictor that has the least impact on the fit. The candidate for dropping is the variable with the smallest Z-score.
Forward-Stagwise Selection is even more constrained than forward stepwise regression as follows
Choose $\mathbf{x}_j$ with highest current correlation $c_j = \mathbf{x}_j^\intercal(\mathbf{y} - \mathbf{\hat{y}})$
Take a small step $0 < \alpha < |c_j|$ in the direction of selected $\mathbf{x}_j$
$\hat{y} \leftarrow \hat{y} + \alpha \cdot \text{sign}(c_j) \cdot \mathbf{x}_j$
By retaining a subset of the predictors and discarding the rest, subset selection produces a model that is interpretable and has possibly lower prediction error than the full model. However, because it is a discrete process that variables are either retained or discarded. It often exhibits high variance, and sometimes doesn't reduce the prediction error of the full model.
Shrinkage Methods¶
Shrinkage methods are more continuous by shrinking the regression coefficients by imposing a penalty on their size, and don't suffer as much from high variability.
\begin{align*} E_{\mathcal{D}}(\mathbf{w}) + \lambda E_{\mathbf{w}}(\mathbf{w}) & = \frac{1}{2} \sum_{i=1}^{n} (y_i - \mathbf{w}^\intercal \phi(x_{i}))^2 + \frac{\lambda}{2} \sum_{i=0}^{m} \lvert w_i \rvert^q \\ & = \frac{1}{2} \left( \mathbf{y}^\intercal \mathbf{y} - 2 (\mathbf{\Phi} \mathbf{w})^\intercal \mathbf{y} + (\mathbf{\Phi} \mathbf{w})^\intercal \mathbf{\Phi} \mathbf{w} \right) + \frac{\lambda}{2} \lVert \mathbf{w} \rVert_q \end{align*}It is equivalent to minimize the unregularized sum-of-squares error subject to the constraint
\begin{align*} \sum_{i=0}^{n} \lvert w_i \rvert^q \le \eta. \end{align*}Examples:
$q=0$ is variable subset selection,
$q=1$ is know as the lasso (Least Absolute Shrinkage and Selection Operator), and
$q=2$ corresponds to ridge regression. We have
One might consider estimating $q$ from data, however it is not worth the effort for the extra variance incurred. We use ridge regression with three different values for a polynomial fitting.
x_train, t_train = create_data(sinusoidal, 10, 0.25)
x_test = np.linspace(0, 1, 100)
t_test = sinusoidal(x_test)
# set degree = 9 because it's the most complicated polynomial model for 10 data points
phi_train = Polynomial(9).dm(x_train)
phi_test = Polynomial(9).dm(x_test)
fig = plt.figure(figsize=(15, 4))
for i, alpha in enumerate([0, 1e-3, 1]):
# alpha is the regularization coefficient
t = LeastSquares(alpha).fit(phi_train, t_train).predict(phi_test)
plt.subplot(1, 3, i+1)
plt.scatter(x_train, t_train, facecolor="none", edgecolor="b", s=50, label="training data")
plt.plot(x_test, t_test, c="g", label="$\sin(2\pi x)$")
plt.plot(x_test, t, c="r", label="fitting")
plt.ylim(-1.5, 1.5)
plt.legend(loc=1)
plt.annotate(r"n={}, $\lambda$={}".format(9, alpha), xy=(0.05, -1.2))
plt.show()
Bias-Variance Trade off
We illustrate bias and variance with a regularization parameter $\ln \lambda \in \{2.6, -0.31, -2.4\}$.
In
x_train, t_train, we generate 100 datasets, and each has 25 sinusoidal data pointsGraphs in the first row demonstrate variance
where each curve is represented by a function $f(\mathbf{x}; \mathcal D)$ with given dataset $\mathcal D$. When $\mathcal D$ is given, $f(\mathbf{x}; \mathcal D)$ is a fitting function based on the dataset (for clarity, only 20 of the 100 fits are shown). Their average $\mathbb{E}_{\mathcal D}\left[ f(\mathbf{x}; \mathcal D)\right]$ gives the fitting curve of graphs in the second row.
- Thus, graphs in the second row demonstrate bias$^2$
where $h(\mathbf{x})$ the optimal prediction, i.e., $\sin (2 \pi x)$ in our case.
The last graph shows variance and bias$^2$ by varying $\lambda$.
funcs = {'Polynomial': [24], 'Gaussian': [np.linspace(0, 1, 24), 0.1], 'Sigmoidal': [np.linspace(0, 1, 24), 10]}
x_test = np.linspace(0, 1, 100)
t_test = sinusoidal(x_test)
def bias_and_var(fname):
phi_test = globals()[fname](*funcs[fname]).dm(x_test)
plt.figure(figsize=(12, 6))
for loc, alpha in enumerate([2.6, -0.31, -2.4]):
plt.subplot(2, 3, loc+1)
# We have 100 data sets, each having 25 data points
x_train, t_train = zip(*[create_data(sinusoidal, 25, 0.8) for i in range(100)])
phi_train = [globals()[fname](*funcs[fname]).dm(x) for x in x_train]
# t[i] is an array contains the prediction values for x_test and
# the fitting function is derived based on i-th data set
t = [LeastSquares(np.exp(alpha)).fit(phi, t).predict(phi_test) for phi, t in zip(phi_train, t_train)]
# For clarity, only 20 of the 100 fits are shown
for i in range(20):
plt.plot(x_test, t[i], c="orange")
plt.annotate(r"$\ln \lambda={}$".format(alpha), xy=(0.05, -1.2))
plt.ylim(-1.5, 1.5)
plt.subplot(2, 3, loc+4)
plt.plot(x_test, t_test, c="g", label="$\sin(2\pi x)$")
plt.plot(x_test, np.asarray(t).mean(axis=0), c="r", label="fitting")
plt.ylim(-1.5, 1.5)
plt.legend()
plt.show()
plt.figure(figsize=(6, 5))
x_train, t_train = zip(*[create_data(sinusoidal, 25, 0.8) for i in range(100)])
phi_train = [globals()[fname](*funcs[fname]).dm(x) for x in x_train]
alpha_range = np.linspace(-2.6, 1.5, 30)
t = [[LeastSquares(np.exp(alpha)).fit(phi, t).predict(phi_test) for phi, t in zip(phi_train, t_train)]
for alpha in alpha_range]
bias2 = ((np.asarray(t).mean(axis=1) - t_test) ** 2).mean(axis=1) #(3.46)
var = np.array(t).var(axis=1).mean(axis=1) #(3.47)
plt.ylim(0, (bias2+var).max()+.01)
plt.plot(alpha_range, bias2, label=r"bias$^2$")
plt.plot(alpha_range, var, label="var")
plt.plot(alpha_range, bias2+var, label=r"bias$2$+var")
plt.xlabel('$\ln \lambda$')
plt.legend()
plt.show()
bias_and_var('Gaussian')
# widget_layout = widgets.Layout(width='250px', height='30px')
# interact(bias_and_var,
# fname = widgets.Dropdown(description='Basis Functions:',
# style = {'description_width': 'initial'},
# options=['Polynomial','Gaussian','Sigmoidal'],
# value='Gaussian',layout=widget_layout));
Singular Value Decomposition
The singular value decomposition (SVD) of $\mathbf{X}_{n \times p} = (\mathbf{x}_1, \ldots, \mathbf{x}_n)^\intercal$ has the form $\mathbf{X} = U D V^\intercal$, where
- $U_{n \times p}$ is orthogonal and the columns are spanning the column space of $\mathbf{X}$
- $V_{p \times p}$ is orthogonal and the columns are spanning the row space of $\mathbf{X}$
- $D_{p \times p}$ is diagonal with $d_1 \ge d_2 \ge \cdots \ge d_p \ge 0$.
We have
\begin{align*} \mathbf{X} \mathbf{w}^{\text{ls}} = UU^\intercal \mathbf{y} = \sum_{j=1}^{p} \mathbf{u}_j (\mathbf{u}^\intercal_j \mathbf{y}) \end{align*}and
\begin{align*} \mathbf{X} \mathbf{w}^{\text{ridge}} = \sum_{j=1}^{p} \mathbf{u}_j \left(\frac{d^2_j}{d^2_j + \lambda} (\mathbf{u}^\intercal_j \mathbf{y}) \right) \end{align*}where
- $\mathbf{u}_j$ is a normalized principle component,
- Ridge shrinks more with smaller $d^2_j$ (small variance),
- Ridge improves interpretability by putting low weights on small variance components.

Note that $\mathbf{X}^\intercal \mathbf{X} = VDU^\intercal UDV^\intercal = VD^2V^\intercal$. Thus
\begin{align*} \mathbf{X} \mathbf{w}^{\text{ls}} &= UDV^\intercal (VD^2V^\intercal)^{-1} VDU^\intercal \mathbf{y} = UDV^\intercal (V^{-\intercal}D^{-2}V^{-1}) VDU^\intercal \mathbf{y} \\ &= UU^\intercal \mathbf{y} = \sum_{j=1}^{p} \mathbf{u}_j (\mathbf{u}^\intercal_j \mathbf{y}) \\ \mathbf{w}^{\text{ridge}} &= (\mathbf{X}^\intercal \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\intercal \mathbf{y} = (VD^2V^\intercal + \lambda VV^\intercal)^{-1}VDU^\intercal\mathbf{y} \\ &= (V(D^2+\lambda \mathbf{I})V^\intercal)^{-1}VDU^\intercal\mathbf{y} \\ &= V^{-\intercal}(D^2+\lambda \mathbf{I})^{-1}DU^\intercal\mathbf{y} \\ \mathbf{X} \mathbf{w}^{\text{ridge}} &=UD(D^2+\lambda \mathbf{I})^{-1}DU^\intercal\mathbf{y} \end{align*}The matrix $D(D^2+\lambda \mathbf{I})^{-1}D$ is diagonal with elements given by
\begin{align*} \frac{d^2_j}{d^2_j + \lambda} \end{align*}Thus,
\begin{align*} \mathbf{X} \mathbf{w}^{\text{ridge}} &= \sum_{j=1}^{p} \mathbf{u}_j \left(\frac{d^2_j}{d^2_j + \lambda} (\mathbf{u}^\intercal_j \mathbf{y}) \right) \end{align*}